Introduction

Visualizing statistics associated with taxa can be difficult due to the hierarchical nature of the information. Traditional graph types used to visualize the relationship between categories (e.g. taxa) and quantities (e.g. abundance) such as bar charts, pie charts, and box plots, are fundamentally two-dimensional. This means the it is usually necessary to only view a ‘slice’ of the data, such as the abundance of taxa of a particular rank, rather than all ranks. To demonstrate this idea, lets display the same data using various graphing techniques and compare their effectiveness. For this example we will be using a sample of 500 sequences from the UNITE fungal ITS database. First, we will use a bar chart to display abundance of taxa at each rank:

plot of chunk unnamed-chunk-3

For ranks with relatively few taxa, this is a satisfactory graphing technique, but it is ineffective once there is more than 20 or so taxa. It also is difficult to discern how sub-taxa are distributed within each taxon. For example, if we only looked at the phylum or class level, we can easily see that Basidiomycota/Agaricomyctes are the most plentiful, but don’t know if that is due to a single highly abundant species, or many moderately abundant species. These details are typically important for the interpretation of results.

Pie charts are also commonly used for this purpose, but they are just a less effective version of a bar chart:

plot of chunk unnamed-chunk-4

Metacoder approches this problem by using size and color to represent numeric data distributed along a phylogenetic tree:

plot of chunk unnamed-chunk-5

In this example, both size and color are being used to represent the abundance of taxa. Using this method, it is clear how child taxa are distributed within their parents and what taxa are unusual (note the unidentified agaricales).

Basic usage

Although there are many options that can be used to make highly customized graphs, plot_taxonomy only needs two arguments to function. These are a list of unique taxon IDs and a corresponding list of IDs of parent taxa (i.e. one rank lower). These two lists effectively constitute an adjacency/edge list, describing how taxa are connected in a phylogenetic tree. We can see the default appearance of the data used in the introduction using the code below. The taxa object is one of the two data.frames returned by the function extract_taxonomy, which is covered in another vignette.

head(taxa)
##      rank            name taxon_id parent_id item_count level
## 1 Kingdom           Fungi        1      <NA>        500     1
## 2  Phylum      Ascomycota        2         1        180     2
## 3   Class   Leotiomycetes        3         2         32     3
## 4   Order      Helotiales        4         3         28     4
## 5  Family Hyaloscyphaceae        5         4         25     5
## 6   Genus         Lachnum        6         5         17     6
plot_taxonomy(taxa$taxon_id, taxa$parent_id)

plot of chunk unnamed-chunk-6

Each vertex (i.e. circle) in the graph represents a taxon and each line represents its membership in a lower taxon.

Vertex/Line size

By default, the size of each vertex is inversely proportional to its rank. This makes higher ranked taxa smaller, making them easier to see. The size of vertices and lines can be scaled to a number associated with each taxon using the size parameter. Below, the number of sequences for each taxon is used to determine vertex size.

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              size = taxa$item_count)

plot of chunk unnamed-chunk-7

Note that it was not necessary to specify the absolute vertex size; the range of absolute vertex sizes is optimized for each graph so as to minimize overlap of vertices and maximize the ranges of sizes. The argument overlap_bias is used to determine how much overlaps are avoided. Higher values mean more importance is given to avoiding overlapping vertices than to maximizing the ranges of sizes. A high overlap_bias makes the connections between taxa more clear, but diminishes the visual effect of vertex size.

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              size = taxa$item_count,
              overlap_bias = 50)

plot of chunk unnamed-chunk-8

To low of an overlap_bias can make the graph hard to read:

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              size = taxa$item_count,
              overlap_bias = 1)

plot of chunk unnamed-chunk-9

Vertex color

The vertex_color argument works in a similar way to size. Numeric values are translated to a range of colors. Below the abundance of samples for each taxon is used to determine color instead of size.

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              vertex_color = taxa$item_count)

plot of chunk unnamed-chunk-10

The range of color used can be set using the vertex_color_series argument. This argument take a list of colors in the form of names, hex color codes, or integers.

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              vertex_color = taxa$item_count,
              vertex_color_series = c("turquoise", "darkorange3", "#4e567d", "gold", 23))

plot of chunk unnamed-chunk-11

Line color

Unlike size, the color of lines can be set independently of vertices, although the default behavior is for the lines to have the same color as the vertices. To only color vertices, you can set the lines to be a constant color:

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              vertex_color = taxa$item_count,
              line_color = "grey")

plot of chunk unnamed-chunk-12

Or vise-versa:

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              vertex_color = "grey",
              line_color = taxa$item_count)

plot of chunk unnamed-chunk-13

You can also set the color palette used for the lines in the same way as you set it for the vertex using the argument line_color_series.

Vertex labels

Labels can be added to vertices using the vertex_label option:

plot_taxonomy(taxa$taxon_id, taxa$parent_id, 
              vertex_label = taxa$name)

plot of chunk unnamed-chunk-14

Label sizes are proportional to vertex size. By default, only labels that are larger than some proportion of the total graph size are printed to avoid excessive crowding. The minimum size of labels that will be printed is controlled by the min_label_size option:

plot_taxonomy(taxa$taxon_id, taxa$parent_id,  
              vertex_label = taxa$item_count,
              min_label_size = 0.01)

plot of chunk unnamed-chunk-15

plot_taxonomy(taxa$taxon_id, taxa$parent_id,  
              vertex_label = taxa$item_count,
              min_label_size = 0.005)

plot of chunk unnamed-chunk-15

Note that the labels are a special kind that scales with the size of the graph. This means that the text size will always be proportional to the graph size regardless of ow big the graph is rendered; however, these special labels take more time to render, so causing too many to be printed drastically slow the rendering of the graph.

Line labels

Lines can be labeled as well using the line_label option, which works similarly to the vertex_label option:

plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              line_label = taxa$name,
              min_label_size = 0.005)

plot of chunk unnamed-chunk-16

Usage examples

Plotting with multiple taxonomy roots

Sometimes a taxonomy has multiple roots. This occurs when there is not a common taxon all items are assigned to, like “Eukaryota” if all your items are associated with eukayotes. When plotting these type of data some might want all the data to share a common “life” root and be on the same tree, whereas others might think it more appropriate to plot each in different trees. metacoder allows for both of these, but defaults to plotting taxonomies with multiple roots as multiple trees.

Default plotting behavior for multiple taxonomic roots

Say someone emails you a list of ncbi ids that you want to plot. You could use the following code to get the taxonomic information.

raw <- "JQ086376.1 AM946981.2 JQ182735.1 CP001396.1 J02459.1 AC150248.3 X64334.1 CP001509.3 CP006698.1 AC198536.1 JF340119.2 KF771025.1 CP007136.1 CP007133.1 U39286.1 CP006584.1 EU421722.1 U03462.1 U03459.1 AC198467.1 V00638.1 CP007394.1 CP007392.1 HG941718.1 HG813083.1 HG813082.1 CP007391.1 HG813084.1 CP002516.1 KF561236.1 JX509734.1 AP010953.1 U39285.1 M15423.1 X98613.1 CP006784.1 CP007393.1 CU928163.2 AP009240.1 CP007025.1 CP006027.1 CP003301.1 CP003289.1 CP000946.1 CP002167.1 HG428755.1 JQ086370.1 CP001846.1 CP001925.1 X99439.1 AP010958.1 CP001368.1 AE014075.1 CP002212.1 CP003034.1 CP000243.1 AY940193.1 CP004009.1 JQ182732.1 U02453.1 AY927771.1 BA000007.2 CP003109.1 CP007390.1 U02426.1 U02425.1 CP006262.1 HG738867.1 U00096.3 FN554766.1 CP001855.1 L19898.1 AE005174.2 FJ188381.1 AK157373.1 JQ182733.1 U39284.1 U37692.1 AF129072.1 FM180568.1 CP001969.1 HE616528.1 CP002729.1 JF974339.1 AB248924.1 AB248923.1 CP002291.1 X98409.1 CU928161.2 CP003297.1 FJ797950.1 CP000038.1 U82598.1 CP002211.1 JQ806764.1 U03463.1 CP001665.1"
ids <- strsplit(raw, " ")[[1]]
contaminants <- extract_taxonomy(ids, regex = "(.*)", key = c(name = "item_id"))

In order to avoid using the NCBI servers more than necessary (and to avoid waiting), I saved the result of above code in an included example data set called contaminants. The code below will use the included data set if the code above is not run (you don’t need to run it to do the plotting).

taxa <- contaminants$taxa
plot_taxonomy(taxa$taxon_id, taxa$parent_id,
              size = taxa$item_count,
              vertex_color = taxa$item_count,
              titles = taxa$name)

plot of chunk unnamed-chunk-18

Spltting a taxonomy into multiple trees

You can also take a taxonomy with a single root and display it by splitting it at a specified level/rank.

file_path <- system.file("extdata", "unite_general_release.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
cat(names(sequences)[1])
## Lachnum_sp|JQ347180|SH189775.06FU|reps|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_sp
original_data <- extract_taxonomy(names(sequences),
                         regex = "^(.*)\\|(.*)\\|(.*)\\|.*\\|(.*)$",
                         key = c(name = "item_info", seq_id = "item_info",
                                 other_id = "item_info", "class_name"),
                         database = "none")
items <- original_data$items
taxa <- original_data$taxa
data_to_plot <- taxa
plot_taxonomy(data_to_plot$taxon_id, data_to_plot$parent_id,
              size = data_to_plot$item_count,
              vertex_color = data_to_plot$item_count)

plot of chunk unnamed-chunk-20

parts <- lapply(split_by_level(taxa$taxon_id, taxa$parent_id, level =  2), function(i) taxa[i, ])

data_to_plot <- parts[[2]]
plot_taxonomy(data_to_plot$taxon_id, data_to_plot$parent_id, no_stem = FALSE, 
              size = data_to_plot$item_count,
              vertex_color = data_to_plot$item_count,
              titles = data_to_plot$name)

plot of chunk unnamed-chunk-21

data_to_plot <- do.call(rbind, parts)
plot_taxonomy(data_to_plot$taxon_id, data_to_plot$parent_id,
              size = data_to_plot$item_count,
              vertex_color = data_to_plot$item_count,
              titles = data_to_plot$name)

plot of chunk unnamed-chunk-21